Directions

The semester project gives you the chance to apply the knowledge and skills you will learn in the course in a way that mimics the ways you can expect to use logistic regression modeling in a real-world setting.

For the project, you will choose one data set from the two listed below and perform a logistic regression analysis. Specifically, you will build a regression model and report on the model statistics and diagnostics. Your final deliverable for the project will be an R Markdown file. The file will contain the analyses detailed in each step with 5-7 written bullet points. Basically, pretend you are presenting to a non-technical audience, and the bullet points would serve as a script outline for your explanation. By non-technical I mean an audience who knows things like “a p-value less than 0.05 mean statistical significance” but cannot really explain the underlying concepts in great detail.

You can choose one of two different data sets to complete the project, either the wine quality data set first analyzed during week 3 or the data set diabetes from the faraway package, which we do not cover elsewhere in the course. Here are the details for each. Remember, you just need to choose one, not both.

Load Libraries

library(tidyverse)
library(ROCR) # <-- for AUC 
library(ResourceSelection) # <-- for Hosmer-Lemeshow Goodness of Fit Test
library(splines) # <-- for splines 

Read and Prepare Dataset

wine_red <- read_delim('winequality-red.csv',delim = ';') %>% 
                    mutate(quality=factor(ifelse(quality>5,1,0),labels = c('low','high')))
names(wine_red)<-make.names(names(wine_red),unique = TRUE)

Introduction

I am analyzing the red wine dataset for my final project. This dataset was collected in 2009 and includes objective sensor data and a subjective quality measure about Vinho Verde style red wine from Northern Portugal. The data has 1599 rows and 11 physicochemical predictors and a single quality score from 0-10 which is the median score of at least 3 evaluations collected from wine experts. For simplicity I will be predicting if the score is high (greater than 5) or low (5 or less). The predictors are listed below :

  1. fixed acidity (tartaric acid - g / dm^3)
    • most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  2. volatile acidity (acetic acid - g / dm^3)
    • the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. citric acid (g / dm^3)
    • found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar (g / dm^3)
    • the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides (sodium chloride - g / dm^3)
    • the amount of salt in the wine
  6. free sulfur dioxide (mg / dm^3)
    • the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide (mg / dm^3)
    • amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density (g / cm^3)
    • the density of wine is close to that of water depending on the percent alcohol and sugar content
  9. pH
    • describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates (potassium sulphate - g / dm3)
    • a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. alcohol (% by volume)
    • the percent alcohol content of the wine

Source for Variable Description

These Predictors will be used with Logistic Regression modeling techniques to predict the subjective quality of the wine.

Project Steps:

The project consists of five steps, which you will work on throughout the second half of the semester. The steps you complete each week will accumulate in your R Markdown file, due at the end of the course. You will receive credit for completing drafts of Steps 1- 5 according to the schedule below. Each of these steps are described in greater detail in the week they are due.

Step 1: Exploratory Data Analysis (EDA)

  • Task: Construct interleaved histograms and scatterplots to explore the relation between the predictors and response. Specifically, choose two predictors and make an interleaved histogram and scatterplot for each. Thus, you should construct four total graphs.

  • Deliverables:
    • Your code in Markdown should produce the plots.
    • Your 5-7 bullet points should explain the graphs including the axes and what each plot means. Remember, interleaved histograms may be easy for you but maybe not so for others.

Alcohol Percentage

ggplot(
  data = wine_red,
  aes(x=alcohol,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
  ggtitle("Histogram of Wine Quality by Alcohol Percent")+
  xlab("Alcohol %")

Total Sulfur Dioxide

ggplot(
  data = wine_red,
  aes(x=total.sulfur.dioxide,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
  ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
  xlab("Total Sulfur Dioxide")

Scatterplot of Alcohol Percentage and Quality

ggplot(
  data = wine_red,
  aes(x=total.sulfur.dioxide,y=quality,color=quality)
) + geom_jitter(alpha=.5) +
  ggtitle("Scatterplot of Wine Quality by Total Sulfur Dioxide") +
  xlab("Total Sulfur Dioxide")+
  ylab("Quality")

Scatterplot of Total Sulfur Dioxide and Quality

ggplot(
  data = wine_red,
  aes(x=alcohol,y=quality,color=quality)
) + geom_jitter(alpha=.5) +
  ggtitle("Scatterplot of Wine Quality by Alcohol %") +
  xlab("Alcohol %")+
  ylab("Quality")

Graph Explanations

  • Higher Alcohol Percentage is associated with better quality wines
  • Lower Total Sulfur Dioxide is associated with better quality wines

Step 2: Variable Selection

  • Task: Perform variable selection via the AIC using the step() function. Your starting model should include all available predictors. The reduced model should be used as your final model for all subsequent steps. Or, if you disagree with the recommended model, you need to indicate why.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give a brief explanation of the algorithm and comment on the variables retained/removed from the model. Are the results intuitive?

Create Binary Response Variable

wine_red <- wine_red %>% 
               mutate(quality=ifelse(quality=='high',1,0))

Create Saturated Model

saturated_model <- glm(quality~.,family=binomial,data=wine_red)

Use step() to select paired down model

aic_step <- step(saturated_model)
## Start:  AIC=1679.63
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol
## 
##                        Df Deviance    AIC
## - pH                    1   1655.9 1677.9
## - density               1   1656.0 1678.0
## - residual.sugar        1   1656.7 1678.7
## - fixed.acidity         1   1657.5 1679.5
## <none>                      1655.6 1679.6
## - citric.acid           1   1660.8 1682.8
## - chlorides             1   1662.1 1684.1
## - free.sulfur.dioxide   1   1662.9 1684.9
## - total.sulfur.dioxide  1   1690.2 1712.2
## - sulphates             1   1698.8 1720.8
## - volatile.acidity      1   1705.9 1727.9
## - alcohol               1   1730.0 1752.0
## 
## Step:  AIC=1677.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + sulphates + alcohol
## 
##                        Df Deviance    AIC
## - density               1   1657.2 1677.2
## - residual.sugar        1   1657.5 1677.5
## <none>                      1655.9 1677.9
## - citric.acid           1   1661.2 1681.2
## - chlorides             1   1662.1 1682.1
## - fixed.acidity         1   1662.8 1682.8
## - free.sulfur.dioxide   1   1663.0 1683.0
## - total.sulfur.dioxide  1   1690.7 1710.7
## - sulphates             1   1701.1 1721.1
## - volatile.acidity      1   1706.8 1726.8
## - alcohol               1   1751.2 1771.2
## 
## Step:  AIC=1677.19
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     sulphates + alcohol
## 
##                        Df Deviance    AIC
## - residual.sugar        1   1657.8 1675.8
## <none>                      1657.2 1677.2
## - citric.acid           1   1662.5 1680.5
## - chlorides             1   1663.1 1681.1
## - fixed.acidity         1   1663.3 1681.3
## - free.sulfur.dioxide   1   1664.2 1682.2
## - total.sulfur.dioxide  1   1691.4 1709.4
## - sulphates             1   1701.1 1719.1
## - volatile.acidity      1   1713.3 1731.3
## - alcohol               1   1839.5 1857.5
## 
## Step:  AIC=1675.84
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
##     alcohol
## 
##                        Df Deviance    AIC
## <none>                      1657.8 1675.8
## - citric.acid           1   1663.0 1679.0
## - chlorides             1   1663.5 1679.5
## - fixed.acidity         1   1664.2 1680.2
## - free.sulfur.dioxide   1   1665.2 1681.2
## - total.sulfur.dioxide  1   1691.4 1707.4
## - sulphates             1   1701.2 1717.2
## - volatile.acidity      1   1713.4 1729.4
## - alcohol               1   1843.8 1859.8

Final Model Selected by step()

formula(aic_step)
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides + 
##     free.sulfur.dioxide + total.sulfur.dioxide + sulphates + 
##     alcohol

Comments on Variables Selected:

  • How the algorithm works:
    • Step starts with the saturated model and decides iteratively which variables it can remove
    • At each step it computes the AIC for a model with each single predictor removed as well as one for no predictors removed
    • If the model with a predictor removed gives a better AIC than one with no predictor removed step removes the predictor that gives the lowest AIC model it repeats this process until the model reaches a number of predictors where removing no predictor gives the lowest AIC
  • Results:
    • The results show that pH, residual sugar and density are not good predictors of quality with red wine.
    • Also during the step process the removal of pH caused the fixed acidity variable to contribute more and become a relevant predictor since those variables are highly related.
    • Both of the variables I selected during my EDA section were relevant predictors and remained in the model

Step 3: Assess Model Fit

  • Task: Check that continuous predictors have a linear relation with the logit using loess plots and perform the Hosmer-Lemeshow (HL) goodness of fit test. If the loess plot is nonlinear, then splines should be used to account for the nonlinearity.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should explain the axes of the loess plot, the conclusion of the loess plot, the basic concepts of the HL test, and its result. If the HL test reveals poor fit, this should be discussed.

Loess Plot

loess_plot <- function(column_name){
  f <- as.formula(paste("quality", column_name, sep = " ~ "))
  y_smooth <- predict(loess(f, data=wine_red))
  zero_one <- which(y_smooth>0 & y_smooth<1)

  plot(jitter(wine_red[[column_name]])[zero_one],log(y_smooth[zero_one]/(1-y_smooth[zero_one])),
       xlab = column_name)
}

Alcohol

loess_plot('alcohol')

The lowess plot for Alcohol clearly shows around 2 splines around 9.5% and 12.5%

Volatile Acidity

loess_plot('volatile.acidity')

The lowess plot for Volatile Acidity does not show a strong need for splines

Sulphates

loess_plot('sulphates')

The lowess plot for Sulphates shows a clear need for a spline around 1

Total Sulfur Dioxide

loess_plot('total.sulfur.dioxide')

The lowess plot for Total Sulfur Dioxide shows a potential spline around 40 is needed

Free Sulfur Dioxide

loess_plot('free.sulfur.dioxide')

The lowess plot for Free Sulfur Dioxide shows splines around 10 and 15 could improve the model

Fixed Acidity

loess_plot('fixed.acidity')

The lowess plot for Fixed Acidity shows that splines around 8 and 12 are needed

Chlorides

loess_plot('chlorides')

The lowess plot for Chlorides shows a spline around .09 should be added

Citric Acid

loess_plot('citric.acid')

The lowess plot for Citric Acid shows that splines around .2 and .5 should be added

Adding Splines

spline_formula <- formula(quality~ns(citric.acid,4)+
                                  ns(chlorides,2)+
                                  ns(fixed.acidity,2)+
                                  ns(free.sulfur.dioxide,4)+
                                  ns(total.sulfur.dioxide,3)+
                                  ns(sulphates,4)+
                                  volatile.acidity+
                                  ns(alcohol,4))
step_with_splines_model <- glm(spline_formula,family=binomial,data=wine_red)

Hosmer-Lemeshow Goodness of Fit Test

Model without Splines

hl_test <- hoslem.test(wine_red$quality,aic_step$fitted.values,10)
hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine_red$quality, aic_step$fitted.values
## X-squared = 16.205, df = 8, p-value = 0.03954

Model with Splines

hl_test <- hoslem.test(wine_red$quality,step_with_splines_model$fitted.values,10)
hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  wine_red$quality, step_with_splines_model$fitted.values
## X-squared = 10.983, df = 8, p-value = 0.2026

Comments on Loess Plot and HL GoF Test:

  • Most of the Loess Plots show that the predictors are nonlinear except Volatile Acidity and thus splines need to be used to create multiple regression lines for each of the remaining predictors in the model. Another choice could be to select a model like a decision tree or neural network that can model features with nonlinear behaviors

  • HL Tests
    • The HL test on the model fit with the step() function shows that the model fit is not adequate. This is likely due to the fact that the predictors are mostly non-linear.
    • After adding splines the model fit shows that it is adequate

Step 4: Model Inferences

  • Task: Report p-values and confidence intervals for significant predictors and check for influential observations. Any influential observations should be removed and the model should be refit. Note any changes in the inferences due to the removal.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give practical interpretation of the inferences and explain the process of checking for influential observations.

Confidence Intervals and p-values

conf_int <- confint(step_with_splines_model) %>% 
              data.frame() %>%
              rownames_to_column(var="Predictors")

p.values <- step_with_splines_model%>% 
              summary() %>%
              coef()%>% 
              .[,4] %>% 
              data.frame()%>%
              rownames_to_column(var="Predictors") 
colnames(p.values) <- c("Predictors", "p.values")
conf_int %>% left_join(p.values)
##                      Predictors     X2.5..    X97.5..     p.values
## 1                   (Intercept) -3.3205881  1.7165029 5.413388e-01
## 2           ns(citric.acid, 4)1 -0.7720941  0.4066690 5.435578e-01
## 3           ns(citric.acid, 4)2 -2.4797959 -0.4478897 4.908975e-03
## 4           ns(citric.acid, 4)3 -2.4027709  0.7404139 2.970513e-01
## 5           ns(citric.acid, 4)4 -2.0003567  3.4452966 6.176041e-01
## 6             ns(chlorides, 2)1 -3.8543374  0.1440308 7.095380e-02
## 7             ns(chlorides, 2)2 -4.6187578  0.9007006 2.135363e-01
## 8         ns(fixed.acidity, 2)1  1.5357446  4.6409308 1.000350e-04
## 9         ns(fixed.acidity, 2)2 -1.4643868  1.1913960 8.224173e-01
## 10  ns(free.sulfur.dioxide, 4)1 -0.5079563  1.6765334 2.911363e-01
## 11  ns(free.sulfur.dioxide, 4)2 -1.1239664  1.0868087 9.704847e-01
## 12  ns(free.sulfur.dioxide, 4)3 -1.8555339  2.9882415 6.371468e-01
## 13  ns(free.sulfur.dioxide, 4)4 -0.3978137  3.9818804 8.596518e-02
## 14 ns(total.sulfur.dioxide, 3)1 -3.4966968 -1.2519440 3.499755e-05
## 15 ns(total.sulfur.dioxide, 3)2 -3.4254130  0.8721582 2.125156e-01
## 16 ns(total.sulfur.dioxide, 3)3 -4.5233351  1.5161565 2.243857e-01
## 17            ns(sulphates, 4)1  1.0769006  4.2157372 1.147396e-03
## 18            ns(sulphates, 4)2  1.3530764  4.0411745 9.196090e-05
## 19            ns(sulphates, 4)3  0.8300265  8.0500382 1.782228e-02
## 20            ns(sulphates, 4)4  0.2356723  4.6793212 2.108640e-02
## 21             volatile.acidity -4.4350368 -2.4208432 3.030894e-11
## 22              ns(alcohol, 4)1 -1.7647206  0.9364081 5.555788e-01
## 23              ns(alcohol, 4)2  1.3185135  3.9064505 6.660549e-05
## 24              ns(alcohol, 4)3 -2.1715741  4.1959644 5.225071e-01
## 25              ns(alcohol, 4)4  1.1333260  6.4132474 8.354436e-03

Influential Observations

Get Cooks Distances

CooksDistance <- cooks.distance(step_with_splines_model)
summary(CooksDistance)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 4.800e-07 5.527e-05 2.129e-04 7.395e-04 6.754e-04 8.246e-02

Count Cooks Distances above the suggested threshold of 1

sum(CooksDistance > 1)
## [1] 0

Count Cooks Distances above the other suggested threshold of 4/length(df)

sum(CooksDistance > 4/length(wine_red))
## [1] 0

Plot of the model fitted values (x-axis) versus the Cooks Distance values (y-axis). Any points lying far above the others should be investigated.

ggplot(
  data = data.frame(model_fitted_values = step_with_splines_model$fitted.values,
                    Cooks_Distance = CooksDistance),
  aes(x = model_fitted_values,y = Cooks_Distance)
) + geom_point()

Comments on Inferences and Influential Observations:

  • After Adding splines a fair ammount of the parameters have 95% confidence intervals that include 0. This seems to be an artifact of the splines and per spline there is genrally a parameter with a parameter CI that does not include 0

  • Influential Observations:
    • There are no influential observations by looking at the Cooks Distance with either of the suggested thresholds
    • A plot of the Cooks Distance and Model Fitted values shows a point with a cooks distance of ~.08 this distance is too small to be influential to the model though

5: Asses Predictive Power

  • Task: Summarize predictive/discriminatory power of the model with an ROC curve and plots of predicted probabilities.

  • Deliverables:
    • Your code in Markdown should show how you used the function, produce the results, and indicate your final model
    • Your bullet points should give a brief explanation of each plot and the interpretation for your model.

Plot ROC Curve

pred <- prediction(step_with_splines_model$fitted.values, wine_red$quality)
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
plot(perf, col=rainbow(10), main="ROC Curve")

Determine AUC

auc.tmp <- performance(pred,"auc"); 
auc <- as.numeric(auc.tmp@y.values)
cat(paste("The AUC for my selected model is:",round(auc,3)))
## The AUC for my selected model is: 0.839

Plot Predicted Probabilities by Predictor

plot(wine_red$fixed.acidity,step_with_splines_model$fitted.values)

plot(wine_red$volatile.acidity,step_with_splines_model$fitted.values)

plot(wine_red$citric.acid,step_with_splines_model$fitted.values)

plot(wine_red$chlorides,step_with_splines_model$fitted.values)

plot(wine_red$free.sulfur.dioxide,step_with_splines_model$fitted.values)

plot(wine_red$total.sulfur.dioxide,step_with_splines_model$fitted.values)

plot(wine_red$sulphates,step_with_splines_model$fitted.values)

plot(wine_red$alcohol,step_with_splines_model$fitted.values)

Comments on Plots and Model Interpretation:

  • Analysis of ROC Curve Plot and AUC
    • The ROC Curve is up and to the left with an AUC of ~.84 which suggests that the model fits the data well and we would classify the model as Excellent using the AUC interpretation shown in class.
    • The AUC shown here is on the same data that the model was trained on so it may overinflate the predictive power that the model actually has. To understand this more we would have to split the data into training and test sets then calculate the AUC for both.
  • Analysis of Plots of Fitted Values by Predictor
    • One of the most striking plots is the last one, Alcohol, as alcohol percentage increases generally our model will predict that the wine is of good quality.
    • Other plots like chlorides generally dont seem to be predictive for most wines but show that deviating off the .1 value generally leads to a wine with a lower predicted value

Conclusion

I built a model that describes the wine quality well enough. The Spline’s added to all but one predictors in the final model were invaluable. The ROC AUC jumped from ~0.17 to ~0.84 due to their addition. Many of the predictors seem to behave with higher order terms which were modeled with splines but it would have been interesting to look at higher order or interaction terms between the variables. There are some other objective predictors I would be interested in adding such as price per bottle, age of the wine, type of grapes used and wine brand information. Sadly all this information is not public with the dataset. I would be interested in getting data about the public’s scores for the wines tested and see if those scores generally track the scores that the experts gave each wine.

A Dataset like this could be used to help predict wines and characteristics of wines that will be successful on the market before any human even tests them. This can help wineries to produce better quality wines that taste better. Also if an individual tasted and scored many of these wines their individual preferences could be encoded into a model that could help them to select similar wines to wines that they have enjoyed before.